DAOS-16979 control: Reduce frequency of hugepage allocation at runtime #15848

tanabarr · 2025-02-05T11:40:46Z

Reduce the frequency of hugepage allocation change requests made to
the kernel during daos_server start-up. Check total hugepages on start
and only request more from kernel if the recommended number calculated
based on server config file content is more than the existing system total.

On first start of server process after a reboot, allocate arbitrarily
large number e.g. enough for 16*2 engine targets. The ambition being
to allocate once and reduce the chance of fragmentation.

Test-tag: pr daily_regression
Allow-unstable-test: true

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

Test-tag-hw-medium: pr daily_regression Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

github-actions · 2025-02-05T11:41:03Z

Ticket title is 'Mitigation against hugepage memory fragmentation'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-16979

tanabarr · 2025-02-11T14:22:37Z

@phender as discussed in order to try to reproduce the DMA grow failure (DAOS-16979) related to hugepage fragmentation I ran this PR with Test-tag-hw-medium: pr daily_regression to try to trigger the failure. Unfortunately https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-15848/1/pipeline/ didn't hit the issue. any other ideas on how to get a baseline to prove a fix? or any other approaches like just landing a fix and seeing if it has the desired result? I'm going to try to generate a local reproducer as well.

…gemem-no-fragment Signed-off-by: Tom Nabarro <[email protected]>

Test-tag: pr daily_regression Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

mjmac

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.

If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

mjmac · 2025-02-26T12:17:26Z

src/control/server/config/server.go

+		minHugepages, maxHugepages, cfgTargetCount, largeTargetCount, msgSysXS)
+
+	if minHugepages > maxHugepages {
+		log.Debugf("config hugepage requirements exceed normal maximum")


Should this be logged at NOTICE level? Who is it for?

this is just a debug message because the user doesn't need to do anything about it, it is an indication that the configuration requires more huge pages than the normal maximum

mjmac · 2025-02-26T12:21:02Z

src/control/server/config/server.go

+			return FaultConfigHugepagesDisabledWithBdevs
+		}
+		if minHugepages != 0 {
+			log.Noticef("hugepages disabled but targets will be assigned to bdevs, " +


I don't disagree with doing something here, but logging "caution is advised" is not particularly helpful, IMO. Is it an error or not? What is the admin supposed to do if/when they happen to notice this message in the server log?

this is to indicate that the server is operating in an unusual mode, the administrator should be aware of that

tanabarr · 2025-02-26T12:49:11Z

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.

If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

daosbuild1 · 2025-02-26T22:10:13Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1420/log

daltonbohning · 2025-02-26T22:31:00Z

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.
If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

FWIW I've seen similar on Aurora after a fresh reboot:
https://daosio.atlassian.net/browse/DAOS-16921?focusedCommentId=135440
And I've only seen that with master, not 2.6.

daosbuild1 · 2025-02-27T09:20:28Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1565/log

daosbuild1 · 2025-02-27T12:44:45Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1518/log

mjmac · 2025-02-27T14:16:29Z

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.
If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

FWIW I've seen similar on Aurora after a fresh reboot: https://daosio.atlassian.net/browse/DAOS-16921?focusedCommentId=135440 And I've only seen that with master, not 2.6.

I don't doubt that there is a problem... My concern is more that it seems like the actual problem is not yet understood, and the proposed approach in this PR is a potential solution for a very specific set of scenarios. Adding a hard-coded configuration for hugepages kind of defeats the purpose of having a configuration mechanism, and it seems likely to cause unintended problems for configurations that are outside of what's being hard-coded in this PR.

NiuYawei · 2025-02-27T15:06:38Z

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.
If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

FWIW I've seen similar on Aurora after a fresh reboot: https://daosio.atlassian.net/browse/DAOS-16921?focusedCommentId=135440 And I've only seen that with master, not 2.6.

I don't doubt that there is a problem... My concern is more that it seems like the actual problem is not yet understood, and the proposed approach in this PR is a potential solution for a very specific set of scenarios. Adding a hard-coded configuration for hugepages kind of defeats the purpose of having a configuration mechanism, and it seems likely to cause unintended problems for configurations that are outside of what's being hard-coded in this PR.

Yes, there could be other unknown issues to be solved (As @daltonbohning mentioned that allocation failure was seen after a fresh reboot when the memory isn't supposed to be fragmented), but allocating hugepages at run time (setting nr_hugepages) is believed likely generating fragmentations.

I think our goal is to avoid allocating hugepages at run time when possible, no matter for production or testing system.

Test-tag: pr daily_regression Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

knard38 · 2025-02-28T08:34:10Z

src/control/server/config/server.go

+		return nil
+	} else if minHugepages == 0 {
+		// Enable minimum needed for scanning NVMe on host in discovery mode.
+		if cfg.NrHugepages < scanMinHugepageCount && mi.HugepagesTotal < scanMinHugepageCount {


I am probably missing something but I was thinking that mi.HugepagesTotal was the number of available Huge Pages, and thus cfg.NrHugePages could not be greater than this first value.

knard38 · 2025-02-28T08:46:07Z

src/control/server/server_utils.go

-			// allocate on numa node 0 (for example if a bigger number of hugepages are
-			// required in discovery mode for an unusually large number of SSDs).
-			prepReq.HugepageCount = srv.cfg.NrHugepages
+			srv.log.Debugf("skip allocating hugepages, no change is required")


Should it not be an error to have bdev without huge pages ?

After chatting with @NiuYawei decided that we should support emulated NVMe with or without hugepages as some usage models may not require them.

Make sense.
NIT, the error FaultConfigHugepagesDisabledWithBdevs raised at line 568 should probably be renamed to something such as FaultConfigHugepagesDisabledWithNvmeBdevs.

daosbuild1 · 2025-02-28T14:25:30Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1220/log

daosbuild1 · 2025-02-28T15:05:23Z

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1429/log

daosbuild1 · 2025-02-28T23:30:44Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1522/log

daosbuild1 · 2025-03-01T05:56:30Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1569/log

DAOS-16979 control: Reduce frequency of hugepage allocation at runtime

b1e9861

Test-tag-hw-medium: pr daily_regression Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

tanabarr added bug control-plane work on the management infrastructure of the DAOS Control Plane go Pull requests that update Go code labels Feb 5, 2025

tanabarr self-assigned this Feb 5, 2025

tanabarr added 2 commits February 26, 2025 11:48

Merge remote-tracking branch 'origin/master' into tanabarr/control-hu…

1377615

…gemem-no-fragment Signed-off-by: Tom Nabarro <[email protected]>

use large allocation initially and reduce frequency of allocations

2605af9

Test-tag: pr daily_regression Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

tanabarr marked this pull request as ready for review February 26, 2025 11:52

tanabarr requested review from a team as code owners February 26, 2025 11:52

tanabarr requested review from mjmac, NiuYawei, kjacque and knard38 February 26, 2025 11:59

mjmac reviewed Feb 26, 2025

View reviewed changes

supply engine memsize based on 1gib per tgt requirements

80d9874

Test-tag: pr daily_regression Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

knard38 reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16979 control: Reduce frequency of hugepage allocation at runtime #15848

DAOS-16979 control: Reduce frequency of hugepage allocation at runtime #15848

tanabarr commented Feb 5, 2025 •

edited

Loading

github-actions bot commented Feb 5, 2025 •

edited

Loading

tanabarr commented Feb 11, 2025

mjmac left a comment •

edited

Loading

mjmac Feb 26, 2025

tanabarr Feb 26, 2025

mjmac Feb 26, 2025

tanabarr Feb 26, 2025

tanabarr commented Feb 26, 2025

daosbuild1 commented Feb 26, 2025

daltonbohning commented Feb 26, 2025

daosbuild1 commented Feb 27, 2025

daosbuild1 commented Feb 27, 2025

mjmac commented Feb 27, 2025 •

edited

Loading

NiuYawei commented Feb 27, 2025

knard38 Feb 28, 2025

knard38 Feb 28, 2025

tanabarr Feb 28, 2025

knard38 Feb 28, 2025 •

edited

Loading

daosbuild1 commented Feb 28, 2025

daosbuild1 commented Feb 28, 2025

daosbuild1 commented Feb 28, 2025

daosbuild1 commented Mar 1, 2025

DAOS-16979 control: Reduce frequency of hugepage allocation at runtime #15848

Are you sure you want to change the base?

DAOS-16979 control: Reduce frequency of hugepage allocation at runtime #15848

Conversation

tanabarr commented Feb 5, 2025 • edited Loading

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Feb 5, 2025 • edited Loading

tanabarr commented Feb 11, 2025

mjmac left a comment • edited Loading

Choose a reason for hiding this comment

mjmac Feb 26, 2025

Choose a reason for hiding this comment

tanabarr Feb 26, 2025

Choose a reason for hiding this comment

mjmac Feb 26, 2025

Choose a reason for hiding this comment

tanabarr Feb 26, 2025

Choose a reason for hiding this comment

tanabarr commented Feb 26, 2025

daosbuild1 commented Feb 26, 2025

daltonbohning commented Feb 26, 2025

daosbuild1 commented Feb 27, 2025

daosbuild1 commented Feb 27, 2025

mjmac commented Feb 27, 2025 • edited Loading

NiuYawei commented Feb 27, 2025

knard38 Feb 28, 2025

Choose a reason for hiding this comment

knard38 Feb 28, 2025

Choose a reason for hiding this comment

tanabarr Feb 28, 2025

Choose a reason for hiding this comment

knard38 Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

daosbuild1 commented Feb 28, 2025

daosbuild1 commented Feb 28, 2025

daosbuild1 commented Feb 28, 2025

daosbuild1 commented Mar 1, 2025

tanabarr commented Feb 5, 2025 •

edited

Loading

github-actions bot commented Feb 5, 2025 •

edited

Loading

mjmac left a comment •

edited

Loading

mjmac commented Feb 27, 2025 •

edited

Loading

knard38 Feb 28, 2025 •

edited

Loading